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Abstract 

Learning Mahanalobis distance metrics in a high- 
dimensional feature space is very difficult especially when 
structural sparsity and low rank are enforced to improve com- 
putational efficiency in testing phase. This paper addresses 
both aspects by an ensemble metric learning approach that 
consists of sparse block diagonal metric ensembling and join- 
t metric learning as two consecutive steps. The former step 
pursues a highly sparse block diagonal metric by selecting 
effective feature groups while the latter one further exploits 
correlations between selected feature groups to obtain an 
accurate and low rank metric. Our algorithm considers 
all pairwise or triplet constraints generated from training 
samples with explicit class labels, and possesses good scala- 
bility with respect to increasing feature dimensionality and 
growing data volumes. Its applications to face verification 
and retrieval outperform existing state-of-the-art methods in 
accuracy while retaining high efficiency. 



1. Introduction 

Similarity measurement has been studied intensively in 
fields such as machine learning, information retrieval, ar- 
tificial intelligence, and cognitive science for a long time; 
it plays a crucial role in learning, reasoning and predicting 
as similar things usually behave similarly [10]. Consid- 
er machine learning algorithms such as K-means, nearest- 
neighbors classifiers and kernel methods - their performance 
critically relies on being given a good metric over the input 
space [23]. Metric learning aims at finding appropriate simi- 
larity measurements between any pair of instances that pre- 
serve desired distance structure. This is not only fundamental 
to understanding high level concepts such as categories, but 
also necessary for many low level tasks such as verification, 
clustering and retrieval. Especially in recent years, many 
supervised metric learning algorithms have been proposed 
to learn Mahanalobis distance metrics for clustering or k- 
nearest neighbor classification, which can be divided into 



two categories according to supervision types. The first cate- 
gory is weakly supervised which learns metrics from directly 
provided pairwise constraints between instances. Examples 
are Xing et al.'s pioneering distance metric learning method 
[23] and Davis et al.'s Information-Theoretic Metric Learn- 
ing (ITML) [3]. Such weak constraints are also known as 
side information [23]. The second category is strongly super- 
vised which requires explicit class labels assigned to every 
instance and generates a potentially large number of con- 
straints between them. It includes Globerson and Roweis's 
Metric Learning by Collapsing Classes (MCML) [7] and 
Weinberger et al.'s Large Margin Nearest Neighbor (LMNN) 
[22]. These supervised methods generally perform well in 
data sets with up to hundreds of features, but are still very 
limited in tasks with high dimensional data. Usually, the 
dimensionality is reduced by Principle Component Analy- 
sis (PCA) beforehand, which can allow noise to overwhelm 
signal that is useful for supervised metric learning as PCA 
is essentially an unsupervised metric learning method. This 
problem becomes even more serious when using overcom- 
plete representations of data where huge redundancy needs 
to be addressed carefully. 

Overcomplete representations of data possess great ro- 
bustness in the presence of noise and other forms of degra- 
dations and thus, are better suited to subsequent processing 
[17]. In this paper, we propose a strongly supervised en- 
semble metric learning approach for low-rank Mahanalobis 
distance metrics based on a sparse combination of features 
from an overcomplete set. It consists of two consecutive 
steps: sparse block diagonal metric ensembling and joint 
metric learning. The former step sequentially selects effec- 
tive features and learns their associated weak metrics that 
correspond to diagonal blocks of Mahanalobis matrix in the 
entire feature space. The latter step learns another Mahanalo- 
bis distance metric in the feature subspace enabled by the 
former step, by jointly considering already selected features 
with an optional low-rank constraint, so as to represent all 
instances in an even lower dimensional space. This two-step 
approach can be viewed as a supervised sparse dimensional- 
ity reduction followed by supervised metric learning. 
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Unlike previous metric learning methods, our implemen- 
tation adopts a convex smooth loss function based on an 
exponential logit surrogate function. We also develop an ef- 
ficient batch learning algorithm for this strongly supervised 
approach to enable simultaneously handling of all pairwise 
or triplet constraints generated from a large amount of data 
with explicit class labels. 

We validate our approach on the Label Faces in the Wild 
(LFW) data set [13] for face verification and another much 
larger data set for face retrieval. In the unrestricted configura- 
tion of LFW, our approach improves the mean classification 
accuracy from the previous record 91.30% (by a commercial 
system from face.com [19]) to 92.58% without using any out- 
side data, heuristic knowledge or 3D face models. For face 
retrieval in the extended data set, our approach encodes each 
face image into a highly discriminative 150-dimensional vec- 
tor. This retains high retrieval accuracy, while achieving 
significantly lower computational cost compared to state-of- 
the-art (such as LMNN [22]). 

The contributions of this paper are three-fold: 

• We introduce the group sparsity into metric learning 
problems with overcomplete representations of data. 

• We develop a two-step batch learning algorithm for ef- 
ficient learning of such metrics in large scale problems. 

• Our algorithm considerably improves the best perfor- 
mance in the unrestricted configuration of LFW. 

2. Related Work 

We limit the discussion of related work to supervised 
metric learning, where supervision is induced by a set of 
similar/dissimilar constraints between instances. For weakly 
supervised metric learning, Xing et al. [23] formulate met- 
ric learning as a constrained convex programming problem 
that minimizes distance between within-class instances with 
the constraint that between-class instances are apart from 
each other by a certain distance. Relevant Components Anal- 
ysis (RCA) [2] learns a global linear transformation from 
the similar constraints, which is improved in Discrimina- 
tive Component Analysis (DC A) [12] by exploiting dissimi- 
lar constraints; both are extensions of Linear Discriminant 
Analysis (LDA) [4]. Information-Theoretic Metric Learning 
(ITML) [3] formulates the metric learning as a Bregman 
optimization problem by minimizing LogDet divergence 
between the learned metric and the prior metric subject to 
similar/disimilar constraints. As for strongly supervised cas- 
es, Neighborhood Component Analysis (NCA) [8] learns a 
distance metric by extending the nearest neighbor classifi- 
er, and the Large-Margin Nearest Neighbor (LMNN) [22] 
fits this idea into a maximum margin framework. Metric 
Learning by Collapsing Classes (MCML) [7] tries to find 
a Mahalanobis distance that can collapse within-class in- 
stances onto a single point. A much more comprehensive 



survey of metric learning methods can be found in [24]. 

Of all previous work on metric learning, [18] and [15] 
share a relatively similar sparsification idea to our proposed 
method. They both derive LogDet divergence objective func- 
tion with element- wise i 1 regularization from ITML [3]; 
the former one tries to solve the dual problem by block 
coordinate descent algorithm [6] while the latter one direct- 
ly addresses the primal problem by alternating lineariza- 
tion method [9] . However, as each feature corresponds to 
a row and a column in the Mahanalobis distance matrix, 
such element-wise regularization is not suitable for sparsify- 
ing feature selections that prefer matrices with whole emp- 
ty rows and columns; whereas straightforwardly applying 
group lasso with row-wise and column-wise I 1 regulariza- 
tion is usually very expensive in high dimensional feature 
space. Instead, the first step of our approach progressive- 
ly constructs a small collection of effective features during 
metric learning procedure, which can be considered as an 
analogy of AdaBoost [5] or matching pursuit [16] with group 
£° regularization. 

3. Supervised Ensemble Metric Learning 

We start this section with symbols and notions. 

• An instance is represented by K feature groups as 

x = [x {1 \x^\--- ,xW] T eR D x^eR d , 

where x^ is the fc-th feature group with d features and 
the concatenated feature dimensionality D = Kd. 

• A squared Mahanalobis distance metric is 

dfj = (xi - Xj) T A(xi - Xj),\/xi,Xj e R D ,A y 0, 

where A is a Mahanalobis matrix. 

i fiC R DxD is the block matrix space in which matrices 
consist of K x K blocks, each of size d x d. 

• Bki is the sparse block matrix space where only the 
block in the fc-th row and the Z-th column is non-zero. 

• LAI ki is the projection of matrix A onto space B^i- 

• \\A\\ F , tr(A) and r(A) areFrobenius norm, trace norm 
and rank of A. 

• \\A\\ s0 = card{fc|L4J« ¥= V [A\ lk ^ 31} is the 
number of feature groups used by A, i.e., our defined 
structural £° norm of A. 

• ITpjjj-) {A) is the projection of A onto Positive Semi- 
Definite space; IT, (A) is to project eigenvalues of A 
onto a simplex to make its trace norm lower than v. 

• X is a training set with class labels for every sample. 

• Xi <~ Xj or 7Tjj = +1 denote x^ and Xj are of the same 
category; Xi ^ x k or irn- = — 1 denote they are of 
different categories. 
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N = \X\, N+ = |{a 



<~ Xi,Xj € X}\ and Algorithm 1 Sparse block diagonal metric ensembling 



Nf = \{xk | Xk 7^ Xi, Xk € <-f }| are the total number 
of training samples, the number of within-class and the 
number of between-class samples to Xi. 

3.1. Problem Formulation 

In this paper, we discuss the situation that instances are 
represented by a large collection of fixed-size feature groups, 
without loss of generality to cases with varying-size feature 
groups. These feature groups could be subspaces of raw fea- 
tures, or wavelet descriptors at different positions and scales 
such as SIFT and LBP features. As there is huge redundancy 
in an overcomplete representation, a desired metric should 
avoid using feature groups with little discriminability so as 
to estimate similarities between instances efficiently without 
sacrifice in accuracy. For this purpose, we formulate the 
metric learning problem as 



A 



mm 

A 



f(A\X) = -\\A\ 



subject to 



£{A\X) (1) 
AhO, \\A\\ s0 < n, tr(A) < v, 



in which £(A\ X) is the empirical loss function regarding to 
discriminability of A upon training set X. The regularization 
term penalizes matrix A by its squared Frobenius norm with 
coefficient A for better generalization ability; A > is 
to keep the learned metric satisfying triangle inequality; 
tr(A) < v is to obtain a low -rank matrix A so that every 
instance eventually can be represented in a low-dimensional 
space; and in particular, || A|| s0 < fi is to impose group 
sparsity on matrix A to insure that only a limited number of 
feature groups (smaller than p) will be actually involved in 
the testing phase. 

3.2. A Two-Step Approach 

However, the optimization task in Equation 1 is NP hard 
due to the structural £° norm and thus extremely difficult to 
solve with high-dimensional overcomplete representations 
of data. We propose a two-step metric learning approach 
consisting of sparse block diagonal metric ensembling and 
joint metric learning to address this problem, which are 
elaborated in Algorithm 1 and Algorithm 2 respectively. 

Sparse block diagonal metric ensembling starts from an 
empty set of feature groups ( A = 0), progressively chooses 
effective feature groups (indicated by k), learns weak metrics 
(A* K ) and combines them into a strong one. Every candidate 
feature group is evaluated by the partial derivative of loss 
function /(•) w.r.t its corresponding diagonal block in matrix 
A. The opposite of this partial derivative matrix is project- 
ed onto Positive Semi-Definite space so that it decreases 
the loss function while keeping the updated matrix Positive 
Semi-Definite. The algorithm selects a diagonal block with 
the largest £ 2 norm of its projected partial derivative matrix, 
and optimizes it as well as a scale factor a adjusting the 



INPUT: X, ii and A. 

A <- 

for t = 1 to p, do 



K - 

A* 



n 



/ I t)f(A\X) I \ 

psd(- L oa JfcJ 



df(A\X) 
OA 

f(aA + A K \X) 



argmax 
ke{i,2,- ,K} 
a* = argmin 

A K yo,A K eB kk ,a£R+ 

A <- a* A + A* 
end for 

A ] = A, L t = U where UAU T = A,U £ R DxD K 
OUTPUT: A f and L t . 



Algorithm 2 Joint metric learning 
INPUT: X, v, A and C/ t . 
Dimension reduction: X^ 

A<-0. 

while not converge do 
V/(A|* t ) = 
Choose a proper step 7. 
4<-II„(A-7V/(A|* t )). 
end while 

Lj = LiL where LL T = A. 
A t = L t Lj. 
OUTPUT: A t and L t . 



{Ujx\x e X}. 



previously learned matrix to minimize the loss function. Af- 
ter fi rounds of such weak metric learning procedure, we 
can obtain an sparse block diagonal matrix, Af, with at 
most [i feature groups activated. Through the eigenvalue 
decomposition, its orthogonal linear transformation matrix 
Li preliminarily reduces the feature dimensionality from D 
to £) t (£) t =r(A t )«£>)• 

Owning to the supervised dimension reduction achieved 
by sparse block diagonal metric ensembling, the joint metric 
learning is capable of further exploiting correlations between 
those selected feature groups in the intermediate feature s- 
pace A| without diagonal block constraints. The projected 
gradient descent method is adopted to solve this optimiza- 
tion problem: the Mahanalobis matrix is iteratively updated 
by its gradient with a proper step size, and then regulated 
by projecting its eigenvalues onto a simplex for satisfying 
tr(A) < v and A > 0. In this way, a secondary linear trans- 
formation matrix L can be learned to map instances onto 
an even lower-dimensional space, and the final linear trans- 
formation matrix Lj = L^L helps represent all instances 
in a D$ -dimensional space, where Euclidean distance is the 
optimal metric for similarity measurement. In other words, 
Af = LfLj is the final Mahanalobis matrix. 

The key component of this metric learning approach is 
the computation of empirical loss function £(A\X) and its 
gradient, which is defined by constraints between instances. 
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From training data with explicit class labels, two types of 
constraints can be generated: pairwise and triplet. For ex- 
ample, let Xi and Xj be two instances of the same category 
and Xk be the instance of another category. From the view 
point of Xi, on the one hand, pairwise constraints are dfj < 9 
and df k > 9 where 6 is a general threshold separating all 
similar pairs from dissimilar ones; constraints of this type 
are adopted in verification problems that determines whether 
a pair of instances belong to the same category or not. On 
the other hand, the triplet constraint is dfj < df k , which 
is clearly a ranking preference designed for clustering or 
retrieval tasks that only concern about relative difference of 
distances between instances. We discuss the two constraint 
types respectively in coming subsections. 

3.3. Pairwise Constraints 

The empirical error of A with threshold 9 on all pairwise 
constraints from X is defined by 



e S (A | X) 



Pr(7r y (d4 -6)>0\x i ,x j GX) (2) 



in which tt^ = ±1 indicates whether Xi and Xj belong to 
the same category or not, and is a characteristic function 
that outputs 1 if (•) is satisfied or otherwise. By replac- 
ing with the exponential-based logit surrogate function 

ip/3(e z ) = 1 "n(i+ J) " an( ^ settm S /3 = 1. we obtain an upper 
bound of this empirical error as 

e e {A\X) < E x ^ Xj exMe<- e ) (3) 

= ie(A\X), 

which is smooth and convex, serving as the empirical loss 
function with pairwise constraints. 



Let rjij = dfj — 9. By the chain rule, we have 



n 

d£ e (A | X) 
dA 



£ 



d£ e (A\X) d Vij 



drjij 



dA 



(4) 



^ ^ W%j ' -Ej ) %j ) ; 



in which the weight term 

^ d£ 9 (A | X) 

Wij — 



drjij 



N 2 In 2 i + e T«(^-9)' 



(5) 



Given the weight matrix W = {wij}N X N, Equation 4 can 
be efficiently computed by 



dl e {A | X) 
dA 



= X(S - W - W 1 )X 1 , 



(6) 



where X = [xi,x 2 , - ■ ■ , xn] is the feature matrix of X and 
S = diag(X; fc w lk + w k i, ■ ■ ■ , Efe w Nk + w kN ). 



3.4. Triplet Constraints 

The empirical error of A on all triplet constraints from 
X is defined by 

e(A | X) = Pr(4 > d? k | x 3 ~ Xl ,x k ^ a*)) (7) 

= ^x i ,Xj~x i ,x k j'Xi'l-dA>df h 

Similarly, we have an upper bound of this empirical error as 

e(A | X) < E Xi , Xi „ Xi>Xk f Xi iJ>p(e d v- d &) = 1(A \ X). (8) 

However, this is not an appropriate loss function as the com- 
putational complexity given {dfjlViJ} could be 0(N 3 ). By 
using the concavity of we further relax it to be 

t(A\X) < E Xi iP (E x ^ XitXk ^ Xi e d v- e ^ (9) 
= E Xi V/3 (E Xj ^e< ■ E Xk7 L Xi e~ d ^ 

= E Xi M^t-4>i) 

= t(A\X), 



which is still smooth and convex, where 



\r V e 
J+ ^ 

— £ 



(10) 



Xk7°Xi 



This is a loss function holding the upper bound of empirical 
error with all triplet constraints generated from X, and its 
computational complexity given {dfj\Vi,j} is just 0(N 2 ), 
the same as that with pairwise constraints in Equation 3. 
Similar to Equation 4 and 5, we have 



d£(A | X) 
dA 



^2 ■ (xi - Xj)(xi - xj) T , (11) 



/30+exp(df 3 ) 



W, 



NN7 ln(l+/3)-(l+/30T0-) 
134,- cxp(-df 3 ) 
NN7 ln(l+/3)-(l+/30+0r) 



if Xj ~ Xi , 
ifXj ^ Xi, 



3.5. Computational Complexity 

Let T be the number of iterations needed for projected 
gradient descent. For both pairwise constraints and triplet 
constraints, sparse block diagonal metric ensembling costs 
0(ND + N 2 ) in memory and 0(/j,(K + T)(N 2 d + Nd 2 + 
d 3 )) in time, which scales up well on a larger K, the number 
of feature groups. Joint metric learning costs O(ND^) in 
memory and 0(T(N 2 D ] + ND 2 + £>?)) in time, which can 
afford more training data compared to sparse block diagonal 
metric ensembling as D-^ <C D = Kd. 
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4. Experiments 

We implement the proposed ensemble metric learning 
method by Python and C++ for face verification and face 
retrieval. To preprocess the data, every face image is aligned 
by its eyes and mouth, and then cropped to keep its central 
110 x 150 region; a Gaussian smooth filter with 1-pixel 
kernel size is applied afterward to suppress white noise. 

Two types of features are employed in our experiments: 
covariance matrix descriptors (CMD) [21] and soft local 
binary pattern histograms (SLBP) [1]. Covariance matrix 
descriptors represent image regions by the covariance matrix 
of basic features such as spatial location, intensity, higher 
order derivatives, etc. We follow the work in [21] where 
every covariance matrix descriptor is a 45-dimensional vec- 
tor. SLBP features are derived from local binary pattern 
histograms (LBP) by considering probabilities of a patch 
being different patterns, so each SLBP descriptor is a 59- 
dimensional vector. In our experiments, 13,260 rectangular 
regions of size varying from 8 x 8 to 96 x 144 are enumerated 
within the 110 x 150 region, covering the face image ho- 
mogeneously. Each rectangular region defines a covariance 
matrix descriptor and a SLBP descriptor; every descriptor 
is whitened on training data by PCA; top 45 dimensions of 
SLBP descriptors are preserved. Eventually, each face image 
is represented by 13,260 covariance matrix descriptors and 
13,260 SLBP descriptors, each of 45 dimensions constituting 
a feature group. Both types of features do not use color in- 
formation of the images. The flowchart of feature extraction 
and dimensionality reduction is illustrated in Figure 1 . 

4.1. Face Verification 

The goal of face verification is to determine whether a 
pair of testing face images belong to the same person or not, 
which fits into the pairwise constraint formulation in Section 
3.3. Our face verification experiment is carried out on the 
(LFW) data set [13], which is a very challenging data set 
with 13,233 face images of 5,749 persons collected from the 
web. It has two views for its own data. View 1 partitions 
the 5,749 persons into training and testing set for develop- 
ment purposes such as model selection and parameter tuning. 
View 2 divides them into ten disjoint folds and specifies 300 
"positive" pairs and 300 "negative" pairs within each fold: 
positive pairs are instances of the same person while negative 
pairs are those of different persons. It is suggested in [13] 
to conduct a ten-fold cross validation experiment, in which 
models are required to be learned on nine folds and tested 
on the 600 pairs of the remaining fold in terms of classifi- 
cation accuracy. We follow the "unrestricted configuration" 
[13] that allows us to use the identity information of training 
data and choose the "aligned" version [20] of LFW for fair 
comparison to previous work on this data set. 

Three different metrics are learned for testing on each 
fold: a CMD based one, a SLBP based one and a combined 



Aligned and Overcomplete Sparse Low-Dimensional 

cropped Image Features Features Representation 



Training Phase 



oj-]c -lo jo 




Testing Phase 




Figure 1. Flowchart of ensemble metric learning. The learning 
phase needs to generate overcomplete representations of face im- 
ages while the testing phase only needs to extract a sparse collec- 
tion of them according to the result of sparse block diagonal metric 
ensembling. Face images are finally coded by low-dimensional 
vectors and the Euclidean distance measures their similarities. 





CMD 


SLBP 


CMD+SLBP 


D 


596,700 


596,700 




r(A t ) 


1627 ±53 


1507 ± 34 






0.116 ±0.013 


0.145 ± 0.02 




r(A t ) 


1312 ± 15 


1179 ± 10 


2302 ± 21 


-<A t ) 


0.083 ±0.011 


0.100 ±0.013 


0.074 ±0.014 



Table 1. Performance of the two steps of our approach on the 
two types of features and the combined one. Note the significant 
reduction in rank r(A^) and testing error t(Aj) of metrics learned 
by sparse block diagonal metric ensembling. Also note the further 
reduction in r(A$) and improvement in e(Af), enabled by joint 
metric learning. 



one by applying joint metric learning on concatenated inter- 
mediate features (i.e., Aj in Algorithm 2) of both types. We 
set /j, = 400 so that at most 400 feature groups are selected; 
v = oo as representing face images in a low-dimensional 
space is not crucial for face verification; 9 = 14 and A = 1 
according to experiments on view 1. To alleviate the difficul- 
ty due to view point changes, we define the flip-free distance 
between Xi and Xj by 

d A (x i ,x J ) = 0.25(d A (x l , Xj) + d A (x' l ,x j ) 
+d A (x i ,x' j ) + d A {x' i , x'j)), 

where x\ and x'j are horizontally flipped images. Whether 
Xi and Xj are of the same person is determined upon this 
distance measurement with a threshold chosen on View 1. 
As shown in Table 1, from D to r(A^) (i.e., D^), the di- 
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mean ± std 



LDML-MkNN, funneled [11] 


87.50 


± 


0.40% 


LBP multishot, aligned [20] 


85.17 


± 


0.61% 


combined multishot, aligned [20] 


89.50 


± 


0.51% 


LBPPLDA, aligned [14] 


87.33 


± 


0.55% 


combined PLDA, funneled&aligned [14] 


90.07 


± 


0.51% 


face.com r2011b 19 [19] 


91.30 


± 


0.30% 


CMD ensemble metric learning, aligned 


91.70 


± 


1.10% 


SLBP ensemble metric learning, aligned 


90.00 


± 


1.33% 


CMD + SLBP, aligned 


92.58 


± 


1.36% 



Table 2. Classification accuracy in the unrestricted configuration 
on LFW. With either a single type or multiple types of features, our 
proposed approach outperforms previous methods in both cases. 

mensionality of overcomplete representations is dramatically 
reduced by sparse block diagonal metric ensembling since 
only 400 features are adopted among the 13,260 ones and 
their corresponding diagonal blocks are not full rank due to 
the positive semi-definite constraint. From r(A^) to r(A$), 
joint metric learning further decreases the dimensionality 
needed to represent each instance, even without the low-rank 
constraint (y — oo in this experiment), while reducing the 
testing error remarkably from e(Af) to e(Aj). Metrics based 
on CMD are overall superior to those based on SLBP, while 
better metrics are learned based on concatenated intermedi- 
ate features of both types. 

Table 2 compares our ensemble metric learning approach 
to existing state-of-the-art methods in the same unrestricted 
configuration on LFW, including LDML-MkNN [11] that 
combines Logistic Discriminant based Metric Learning and 
Marginalized k-NN classifiers, multishot [20] that uses multi- 
ple one shot similarity score based on Information Theoretic 
Metric Learning [3], PLDA [14] that adopts Probabilistic 
Linear Discriminate Analysis instead of distance based meth- 
ods, and face.com [19], a commercial system without any 
details published. With single type of features, the CMD 
based metrics achieve 91.70% and the SLBP based ones 
achieve 90.00% in average accuracy, both outperforming the 
best of previously published work (87.33% by LBP PLDA 
[14]); with multiple types of features, the ensemble metric 
learning method further improves the average accuracy to 
92.58%, which is even better than the result of face.com [19] 
that is equipped by an accurate 3D face model to overcome 
pose and illumination variations. ROC curves of methods 
with multiple types of features are shown in Figure 2. 

A data-driven method like ours does not require heuristics 
such as the face structure. Hence, our approach can easily 
generalize to metric learning problems on object types other 
than the face. But it also has the drawback of being sensitive 
to the choice of training data when there are only a limited 
amount of them available. This explains the larger standard 
deviation of our approach's classification accuracy compared 
to the other methods in Table 2. This drawback can be 
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LDML-MkNN, funneled [11] 
combined multishot, aligned [20] 






combined PLDA, funneled & aligned [14] 
face.com r2011b [19] 

CMD + SLBP ensemble metric learning, aligned 



''0.0 0.1 0.2 0.3 0.4 0.5 

false positive rate 



Figure 2. ROC curves of state-of-the-art methods using multiple 
types of features in unrestricted configuration of LFW. 

alleviated by using more training data. 

Our method takes about one day to train a metric for one 
fold on five Intel 2.2GHz Xeon servers. The learned metric is 
of high computational efficiency owing to the group sparsity 
of features. In practice, it takes less than 50 ms to obtain the 
final vector representation of a face image on one server, and 
less than 1 ms to compute the Euclidean distance between 
two such vectors. 

4.2. Face Retrieval 

Unlike face verification, face retrieval aims at searching 
a reference database for registered faces that are similar to 
a given query. The triplet constraint formulation in Section 
3 .4 is adopted since only relative preference is considered 
in this situation. Three data sets are involved in this face re- 
trieval experiment: LFWext, subLFWext and Qry60. LFWext 
includes 171,509 images of 4,740 persons, collected from in- 
ternet based on the name list of LFW; subLFWext is a subset 
of LFWext containing 1,275 people, each having exactly 40 
images; Qry60 is a different set consisting of 2,040 images 
from 60 people outside the name list of LFW. Metrics are 
learned on subLFWext or LFWext and tested on Qry60. Only 
covariance matrix descriptor features are employed in this 
experiment for simplicity. 

We study the proposed joint metric learning method with 
two benchmarks: LDA [4] and LMNN [22]. Their perfor- 
mance is evaluated by leave-one-out 3-NN classification 
error on Qry60, mean average precision of retrieving the 
same person with each of its images inside Qry60, and mean 
average precision by injecting Qry60 into LFWext and re- 
trieving them back. We denote these three evaluation metrics 
by £q, rnAPg and itiAPql hereafter. 

As the first step, an intermediate metric Af is learned 
by sparse block diagonal metric ensembling on subLFWext, 
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k 


e Q 


mAP Q 


mAP QL 


r(A t ) 


5 


0.0696 


0.6563 


0.3321 


1033 


10 


0.0588 


0.6926 


0.3801 


816 


15 


0.0578 


0.7050 


0.3975 


714 


20 


0.0554 


0.7125 


0.4078 


649 


25 


0.0544 


0.7175 


0.4134 


602 


30 


0.0544 


0.7207 


0.4161 


572 


35 


0.0539 


0.7209 


0.4155 


561 


39 


0.0583 


0.7093 


0.3971 


594 



Table 3. Performance of joint metric learning with different k (k < 
39 as every person has exactly 40 face images in subLFWext). 



with parameter settings as /i = 400, f3 = 10 6 and A = 1. 
The rank of A-\ is 2513 so all face images are represented in 
a 2513-dimensional intermediate feature space. 

4.2.1 Target neighbor number k 

The concept of target neighbors is introduced by LMNN 
[22] specifically for fc-NN classification problems. Given 
an instance, among all instances of the same category, only 
the top k closest ones are adopted to generate triplet con- 
straints for metric learning. It helps relax metric learning 
by eliminating too difficult triplet constraints that contain 
dissimilar but within-class instance pairs. Our approach can 
easily accommodate this concept by ignoring within-class in- 
stances that are not among the top k closest ones to Xi during 
computing 0+ in Equation 10. We apply joint metric learn- 
ing with different target neighbor numbers on subLFWext, 
and compare their results in terms of classification error, 
retrieval accuracy and the rank of final metrics. As shown 
in Table 3, optimal choices of k are either 30 or 35. On the 
one hand, targeting more neighbors can enforce more triplet 
constraints to make the learned metric generalize better to 
unknown testing categories like the 60 persons in Qry60. On 
the other hand, ignoring a few too difficult within-class pairs 
actually alleviates the risk of overfitting since they are very 
likely to be just outliers or even mislabeled data. 

4.2.2 Joint metric learning vs. LMNN 

The joint metric learning method with triplet constraints is 
an analogy to LMNN [22] in the nature of maximizing the 
margin between instances of different categories. The two 
methods differ in loss functions (smooth vs. hinge) and 
regularization terms. We reduce the dimensionality of the 
intermediate feature space to 100, 200, 300, 500 and 1000 by 
PC A, and set the target neighbor number k to be 5, 10 and 30 
respectively to compare joint metric learning to the improved 
LMNN 1 implemented by the active set method. For joint 
metric learning we use v = +oo, A = 1, P = 10 6 . For 
LMNN we use default parameters. We still use subLFWext 
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Figure 3. The left figure compares LMNN and joint metric learning 
(JML) in convergence speed of eg with k — 5 in 100 dimensional 
space; the right one compares LDA and joint metric learning (JML) 
in iiiAPql with different final dimensionality. 
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Figure 4. Comparisons between LMNN and joint metric learn- 
ing (JML) in eg and hiAPq. Note the consistent improvement 
achieved by JML using more target neighbors and higher feature 
dimensionality. 

for training. Experiments are done on a single Intel 2.2GHz 
Xero server. 

As shown in the left of Figure 3, the leave-one-out 3-NN 
classification error by joint metric learning drops faster than 
that by LMNN in the 100-dimensional feature space when 
k = 5; the former converges after 71 iterations while the 
latter runs for about 1 ,400 iterations. Owing to the relax- 
ation of empirical loss with triplet constraints in Equation 9, 
joint metric learning is able to compute the exact gradient 
of the loss function efficiently. Though each iteration is rel- 
atively slow, it generally requires much fewer iterations to 
converge compared to the LMNN algorithm implemented 
by active set method. Especially for hard problems where 
different categories are not well separated, the active set has 
to maintain a huge number of triplet constraints and spends a 
significant amount of time in updating them. We set the max- 
imum iteration number to be 1000 for LMNN, as it barely 
improves after that. According to comparisons in Figure 4, 
joint metric learning consistently benefits from more target 
neighbors and higher feature dimensionality, while LMNN 
seems unable to take advantage of these changes and even 
suffers from them. 

From the time complexity reported in Table 4, LMNN 
does not scale well with respect to more target neighbors 
and higher-dimensional feature spaces. Metric learning by 
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LMNN 


Joint metric learning 


dim 


k = 5 


k = 10 


k = 30 


k = 5 


fc = 10 


k = 30 


100 


2.5 


3.5 


6.4 


1.9(71) 


1.6(60) 


1.2(45) 


200 


4.3 


5.8 


13.8 


6.0(190) 


2.6(80) 


1.9(59) 


300 


6.9 


9.7 


28.7 


4.9(119) 


3.8(99) 


2.6(70) 


500 


15.1 


25.1 




7.1(130) 


6.6(119) 


4.4(81) 


1000 


63.7 






10.3(126) 


10.5(130) 


7.6(91) 



Table 4. Running hours of LMNN and joint metric learning. In 
the bracket are the numbers of iterations (1000 for LMNN). Three 
cases of LMNN failed in our experiment as they exceeded the 
memory limit of our server (16G). 



LMNN was not even feasible in three cases because the 
program exceeded the memory limit (the server has 16G 
memory). In contrast, our method can easily accommodate 
problems of larger scales and yields steady training times 
per iteration with any target neighbor number. 

4.3. Joint metric learning vs. LDA 

In this subsection, we aim at the best practice of met- 
ric learning in the 2513-dimensional intermediate feature 
space on the LFWext data set that contains 171,509 images 
from 4,740 persons. Parameter settings are v = 6, A = 1 
and /3 — 10 6 . It takes about 27 hours to train the metric 
on 8 Intel 2.2GHz Xero servers. Though the rank of final 
metric A$ is 800 owing to the trace norm constraint, it is 
still not low enough for face retrieval tasks. Additionally, 
we retain only d eigenvectors with the largest eigenvalues 
(i.e., r(A$) = d) to represent every instance more econom- 
ically. As LMNN cannot be applied to such a large-scale 
problem, we compare joint metric learning to LDA in their 
performance of retrieving Qry60 from LFWext + Qry60 (i.e., 
itiAPql). As shown in the right of Figure 3, joint metric 
learning significantly outperforms LDA that becomes sat- 
urated when r(Aj r ) > 200. By comparing to the itiAPql 
score obtained by the metric learned on subLFWext (Table 
3), we also observe considerable improvement in retrieval 
accuracy by using more training data. 

In practice, we keep the top 150 projection vectors to 
represent a face by a 150-dimensional vector, which makes 
retrieval in large databases extremely efficient. On a single 
Intel Xero 2.2GHz server, it only takes about 2 seconds to 
exhaustively explore an unlabeled database with 4 million 
faces and find the most similar ones to a query. 

5. Conclusion 

This paper presents an efficient two-step metric learning 
algorithm for large scale problems with overcomplete rep- 
resentations of data. A highly sparse Mahanalobis distance 
metric is learned which selects only a small portion of effec- 
tive feature groups. With this metric, every instance can be 
represented by a compact vector for efficient verification or 
retrieval. Our future work focuses on two aspects: 1) further 



relaxation of empirical loss function to pursue lower com- 
plexity for larger scale problems; 2) a regularization method 
that can incorporate prior knowledge easily. 
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